A  version  of  the  fast  multipole  method  (FMM)  is  described  for  charge  distributions  on 
the  line.  Previously  published  schemes  of  this  type  relied  either  on  analytical  represen¬ 
tations  of  the  potentials  to  be  evaluated  (multipoles,  Legendre  expansions,  Taylor  series, 
etc.),  or  on  the  Singular  Value  Decomposition  (SVD);  in  contrast,  the  algorithm  of  this 
paper  utilizes  the  matrix  compression  scheme  described  in  H.  Cheng,  Z.  Gimbutas,  P.  G. 
Martinsson,  and  V.  Rokhlin.  “On  the  compression  of  low  rank  matrices”  SIAM  J.  Sci. 
Comput.  26(4)0389-1404,  2005.,  resulting  in  substantial  improvements  in  the  CPU  time 
requirements.  Furthermore,  the  scheme  of  this  paper  is  applicable  to  a  wide  variety  of 
potentials;  in  this  respect,  it  is  similar  to  the  SVD-based  FMMs.  The  performance  of  the 
scheme  is  illustrated  with  several  numerical  examples. 
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An  accelerated  kernel-independent  fast  multipole 
method  in  one  dimension 

P.G.  Martinsson  and  V.  Rokhlin 

1.  Introduction 

We  consider  the  problem  of  rapidly  evaluating  the  sum 
N 

(1.1)  um  =  Y^K(xm,xn)qn,  for  m  =  1,. . N, 

n=l 

given  a  set  of  points  (xn)^=1  in  M,  a  kernel  K  that  is  smooth  away  from 
the  diagonal,  and  a  set  of  real  or  complex  numbers  ( qn)n=i ■  We  refer  to  the 
numbers  qn  as  “sources” ,  the  numbers  un  as  “potentials” ,  and  the  points  xn 
as  “source  locations” . 

The  sum  (1.1)  can  be  straight-forwardly  evaluated  using  0(N2)  floating 
point  operations.  In  many  applications,  this  cost  is  prohibitively  large,  and 
a  number  of  methods  that  reduce  the  cost  to  0{N )  or  0(N  logK  N)  have 
been  developed,  [1,  8].  In  this  paper,  we  describe  an  0(N )  method  that  is 
a  development  of  the  fast  multipole  method  [8,  9].  The  changes  introduced 
enable  the  application  of  the  method  to  a  large  class  of  kernels,  and  makes 
the  method  significantly  faster  than  existing  methods  for  the  case  where 
the  sum  (1.1)  has  to  be  evaluated  several  times  for  a  fixed  set  of  locations 
{xn}n=i  (see  Section  6). 

There  are  two  principal  differences  between  the  method  presented  here 
and  the  original  fast  multipole  method  of  [8],  The  first  concerns  the  way 
sources  and  potentials  are  represented.  While  the  fast  multipole  method  re¬ 
lies  on  an  analytic  properties  of  the  kernel,  the  method  presented  here  relies 
on  a  pre-computation  step  that  adaptively  constructs  representations  that 
are  optimized  for  the  given  kernel,  and  the  given  source  locations.  The  sec¬ 
ond  difference  concerns  the  treatment  of  the  interactions  between  adjacent 
sub-intervals  on  the  finest  level  of  sub-division.  While  the  original  fast  mul¬ 
tipole  method  evaluates  such  interactions  directly,  the  representations  used 
by  the  method  of  this  paper  enables  the  compression  of  such  interactions. 

The  method  presented  here  is  similar  to  the  methods  of  [7,  14,  23]  in 
that  they  all  rely  on  adaptively  computed  representations,  and  thus  work 
for  a  wide  range  of  kernels.  While  the  methods  of  [7,  23]  rely  on  the  singu¬ 
lar  value  decomposition  to  compress  the  kernel,  and  [14]  relies  on  the  QR- 
decomposition,  this  paper  is  based  on  a  technique  described  in  [4,  15,  16] 
that  we  refer  to  as  “skeletonization”.  Similar  techniques  were  previously 
used  in  [17,  18,  22]. 

This  paper  is  structured  as  follows:  Section  2  lists  some  results  from  nu¬ 
merical  linear  algebra  that  will  be  of  use.  Section  3  describes  the  adaptive 
technique  for  the  representation  of  sources  and  potentials  that  the  current 
method  relies  on.  Section  4  describes  a  fast  summation  scheme  based  on 
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the  same  data  structures  as  the  original  fast  multipole  method,  but  using 
the  representation  described  in  Section  3.  Section  5  describes  an  additional 
acceleration  of  the  method  based  on  compressing  interactions  between  all 
boxes,  including  adjacent  ones.  Section  6  reports  the  results  of  several  nu¬ 
merical  experiments,  and  Section  7  summarizes  the  results. 

Remark  1.  This  paper  deals  with  the  one-dimensional  case.  Extensions  of 
the  method  to  higher  dimensions  are  under  investigation  and  will  be  reported 
at  a  later  date. 


2.  Preliminaries 


The  fast  summation  technique  described  in  this  paper  achieves  acceler¬ 
ation  by  approximating  off-diagonal  blocks  of  the  matrix  [K{xm,  £n)]^n=1 
in  (1.1)  by  low-rank  matrices.  The  particular  matrix  factorization  we  use  to 
represent  the  low-rank  matrix  was  described  in  [12]  and  [4].  The  following 
lemma  summarizes  the  facts  needed  for  our  purposes. 

Lemma  1.  Let  A  be  an  M  x  N  matrix  of  rank  k  with  columns  C\, ... ,  CV 
and  rows  R\, ... ,  Rm,  so  that 


Then 

(2.1) 


A  =  [C1---Cn}  = 


Ri 

Rm 


A  =  Aco\  o  proj, 


where  proj  is  a  kxN  matrix  that  contains  the  kxk  identity  as  a  submatrix, 
and  where  j4coi  is  an  M  x  k  matrix  consisting  of  k  columns  of  A, 

-^col  =  [f-'ni  >  •  •  •  j  Cn*J  ■ 


Furthermore, 


(2.2) 


A  =  eval  o  v4row, 


where  eval  is  an  M  xk  matrix  that  contains  the  kxk  identity  as  a  submatrix, 
and  where  Aro w  is  a  k  x  N  matrix  consisting  of  k  rows  of  A, 


A 


row 


Moreover,  no  elements  of  eval  or  proj  have  magnitude  larger  than  1 . 


Remark  2.  Since  the  matrix  proj  contains  a  kxk  identity  matrix  it  can  be 
applied  to  a  vector  using  k  x  (N— k)  multiplications  and  additions.  Similarly, 
the  matrix  eval  can  be  applied  to  a  vector  using  k  x  ( M  —  k )  multiplications 
and  additions. 
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Remark  3.  As  a  direct  consequence  of  Lemma  1,  the  condition  number 
of  proj  is  bounded  by  i/l  +  k(N  —  k )  and  the  condition  number  of  eval  is 
bounded  by  a/1  +  k(M  —  k). 

Remark  4.  For  simplicity,  Lemma  1  is  stated  only  for  the  case  where  the 
matrix  A  has  exact  rank  k.  Similar  factorizations  can  be  constructed  for 
matrices  of  approximate  rank  k,  see  [4]. 

Remark  5.  Methods  for  computing  the  factorizations  in  Lemma  1  are  de¬ 
scribed  in  [4,  12].  The  computational  cost  is  typically  O(MNk)  but  can  in 
rare  cases  be  slightly  higher. 

3.  Representation  of  functions  via  tabulation 

3.1.  Outline.  At  the  core  of  the  original  fast  multipole  method  is  a  tech¬ 
nique  for  compactly  representing  sources  and  potentials.  A  source  distribu¬ 
tion  inside  a  box  is  represented  by  a  multipole  expansion  about  the  center 
of  the  box.  From  this  expansion,  the  potential  caused  by  the  source  dis¬ 
tribution  at  distant  target  points  can  be  evaluated.  For  a  given  accuracy, 
only  a  small  number  of  terms  in  the  expansion  are  needed,  regardless  of 
how  many  sources  made  up  the  original  distribution.  In  the  same  way  that 
source  distributions  are  efficiently  represented  via  multipole  expansions,  po¬ 
tentials  are  represented  by  giving  the  expansion  coefficients  in  an  expansion 
in  harmonic  polynomials. 

In  this  section,  we  describe  an  alternative  technique  for  representing 
sources  and  potentials.  In  order  to  describe  the  technique,  we  consider  a  sim¬ 
ple  model  problem:  Assume  that  we  are  given  N  source  locations  (yn)n=i  *n 
a  set  b,  M  target  locations  (im)^=1  in  a  set  a,  an  interaction  kernel  K(x,  y), 
and  that  for  a  given  vector  of  sources  qb  =  (gn)/=1,  we  wish  to  determine 
the  vector  of  potentials  ua  =  {um)m=i  given  by 

(3.1)  ua  =  direct(o,  b)  qb, 

where  direct(a,  b)  is  the  M  x  N  matrix  with  entries  K(xm,  yn).  We  demon¬ 
strate  that  if  the  matrix  direct(a,  b)  has  rank  k  (to  within  some  precision 
e),  then  it  is  possible  to  choose  a  subset  of  k  source  locations  (ynj)j=i  with 
the  property  that  the  potential  ua  can  be  replicated  at  all  source  points  by 
placing  some  “proxy”  sources  on  the  points  ynj .  These  proxy  sources  form 
the  representation  of  the  original  source  distribution.  Similarly,  it  is  possible 
to  choose  a  subset  of  k  target  locations  (xmi)i= i  with  the  property  that  if 
the  potential  is  known  at  these  k  points,  then  it  can  be  interpolated  to  all 
the  remaining  points.  The  potentials  (umi)-=1  form  the  representation  for 
the  potential  ua. 

The  representation  technique  described  in  this  section  has  two  principal 
advantages  over  techniques  based  on  analytic  methods,  such  as  multipole 
expansions:  (1)  It  is  cheaper  to  construct  the  proxy  sources  that  represent 
a  source  distribution  than  it  is  to  compute  the  corresponding  multipole 
expansion  (or  any  other  similar  representation).  (2)  The  operator  that  maps 
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FIGURE  1.  The  computational  domain  described  in  Section  3.2. 


a  representation  for  a  source  distribution  to  a  representation  for  a  potential 
(sometimes  called  a  “translation  operator”)  is  a  submatrix  of  the  original 
matrix  of  interaction  direct(a,  6).  As  a  result,  this  operator  need  never  be 
separately  constructed  or  stored. 

3.2.  Notation.  We  assume  that  we  are  given  a  computational  domain  fl, 
containing  a  number  of  source  locations.  For  a  given  subset  b  C  SI,  we  let 
(yn)n=i  denote  the  locations  inside  6,  and  we  let  qb  =  (qn)n=i  denote  a  set 
of  sources  located  at  the  points  (yn)n= i-  We  let  a  denote  the  set  of  all  points 
x  that  are  well-separated,  from  b,  meaning  that  dist (x,b)  >  diam(fe),  where 
diam(6)  =  supy  2//eb  \y  -  y'\.  We  let  (xm)^=1  denote  the  locations  inside  a, 
and  let  ua  =  (um)^=l  denote  the  potential  on  a  induced  by  the  charges  qb 
(so  that  ua  and  qb  satisfy  (3.1)).  See  Figure  1. 

3.3.  Construction  of  the  representation.  We  first  consider  the  task  of 
evaluating  a  potential  in  a  caused  by  a  set  of  charges  in  b.  To  this  end, 
we  form  the  M  x  N  matrix  A  =  direct  (a,  b)  with  entries  K(xm,yn),  we 
determine  its  e-rank  k ,  and  form  a  k  x  N  matrix  proj(6),  and  an  index 
vector  (nj)^=1  such  that,  cf.  (2.1), 

A  =  Acoi  o  proj(fc)  +  0(e), 

where  Aco\  is  the  M  x  k  matrix  whose  j’th  column  is  the  rij’th  column  of 
A.  Then  given  any  charge  distribution  qb  on  6,  we  form  the  vector 

(3.2)  ^b  =  pro](b)qb  eCk. 

The  vector  ipb  has  the  property  that  the  potential  in  a  caused  by  the  charge 
distribution  qb  can  to  within  precision  e  be  reconstructed  from  ipb,  since 

(3.3)  ua  =  A  qb  =  (Aco\  o  proj(6)  +  0(e))  qb  =  Aco j  ipb  -I-  O(e). 

We  say  that  ipb  is  the  outgoing  representation  of  qb,  and  that  the  points 
(ynj  )k=1  form  the  outgoing  skeleton  of  b. 

We  next  consider  the  task  of  evaluating  a  potential  in  b  caused  by  a  charge 
distribution  in  a.  To  this  end,  we  form  the  N  x  M  matrix  B  =  direct  (6,  a) 
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with  entries  K(yn , xm),  we  determine  its  e-rank  k,  and  form  an  Nxk  matrix 
eval(6),  and  an  index  vector  (rij)* :=1  such  that 

B  —  eval(6)  o  J5row  +  O(e), 

where  2?row  is  the  kx  M  matrix  whose  i’th  row  is  the  rij’th  row  of  B.  Then 
given  any  distribution  qa  of  charges  on  a,  we  form  the  vector 

(3.4)  4>b  =  Brow  qa  6  Ck. 

The  vector  4>b  has  the  property  that  the  potential  on  b  can  to  within  precision 
e  be  reconstructed  from  <f>b  only,  since 

(3.5)  ub  =  Bqa  —  (eval(b)  o  £?row  +  0(e))  qq  =  eval (b)  (f)b  +  0(e). 

We  say  that  4>b  is  the  incoming  representation  of  ub,  and  that  the  points 
(ym)i= i  form  the  incoming  skeleton  of  b. 

Remark  6.  The  numbers  in  the  vectors  ipb  and  (pb  admit  simple  heuristic 
interpretations:  Writing  out  equation  (3.3)  componentwise,  we  find  that 
k 

(3.6)  u^n  =  ^2K(xm,ynj)xpb  +  0(e),  m  =  l,...,M. 

j-i 

In  other  words,  the  numbers  ipb  can  be  interpreted  as  charges  that  replicate 
the  potential  ua  when  placed  at  the  skeleton  points  yUj .  Analogously,  writing 
out  equation  (3.4)  componentwise,  we  find  that 
M 

(3.7)  4>bi  =  ^K(yni,xm)q^n  +  0(e)  =  ubn.  +  °(e),  i  =  l,...,k. 

m=l 

In  other  words,  the  number  <j>\  is  simply  the  potential  at  the  point  y7H. 

Remark  7.  For  many  kernels  it  is  possible  to  prove  that  when  two  sets 
a  and  b  are  separated  by  some  finite  distance,  there  exist  basis  functions 
{fjYj=i  and  {gj}p1=i  such  that 

v 

(3.8)  sup  | K(x,y)~Yl  fj(x)gj(y)  \  <  e, 

(x,y)£axb  J=1 

where  p  tends  to  scale  as  a  small  power  of  |  log  er |  as  e  — »  0.  When  (3.8) 
holds,  the  number  p  provides  an  upper  bound  on  the  numerical  rank  of  the 
matrix  of  interaction  direct  (a,  6),  regardless  of  the  number  of  target  and 
source  points  and  their  locations.  Moreover,  when  a  formula  like  (3.8)  can 
be  constructed  using  analytical  properties  of  the  kernel,  it  can  be  used  to 
accelerate  the  computation  of  proj  and  eval,  cf.  Section  3.6. 

Remark  8.  In  most  environments,  it  is  possible  to  construct  the  represen¬ 
tations  (3.3)  and  (3.5)  in  such  a  manner  that  the  outgoing  and  the  incoming 
skeletons  are  identical,  see  [4].  This  causes  only  a  minor  increase  in  the 
number  p. 
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3.4.  Converting  outgoing  to  incoming  representations.  In  this  sec¬ 
tion,  we  construct  a  matrix  that  converts  the  outgoing  representation  for  a 
set  of  charges  in  one  box,  to  an  incoming  representation  for  the  potential 
induced  by  this  set  of  charges  in  a  different  box. 

Given  two  well-separated  boxes  a  and  b  in  fl,  suppose  that  we  are  given  the 
outgoing  representation  V’6  =  (V’j)jLi  for  a  charge  distribution  qb  in  b ,  and 
that  we  wish  to  determine  the  incoming  representation  (f)a  for  the  potential 
ua  induced  by  qb.  We  let  (xm)^=1  and  ( yn)n-\  denote  the  locations  in  a  and 
b ,  and  we  let  (xmi)i~i  and  (ynj)jL  1  denote  the  corresponding  skeletons.  Then 
equation  (3.7)  implies  that  <^>f  =  u^.  +  0(e),  and  equation  (3.6)  implies  that 

unH  —  Sjii  K (xmi ,  Vrij )  V’j  +  0(e).  In  other  words,  letting  oi(a,  b)  denote 
the  ka  x  kb  matrix  with  entries  K(xmi,ynj),  we  find  that 

(3.9)  <pa  =  oi(a,  b)  +  0(e). 

We  refer  to  the  matrix  oi(a,  b)  as  an  outgoing-to-incoming  translation  op¬ 
erator.  This  operator  is  a  submatrix  of  direct(a,  b).  One  ramification  of 
this  fact  is  that  oi(a,  b)  need  not  be  explicitly  constructed.  Since  the  kernel 
K(x,y)  is  known,  oi  (a,  6)  is  uniquely  defined  by  the  incoming  and  outgoing 
skeletons  of  a  and  b. 

Remark  9.  The  matrix  direct(a,  b)  is  related  to  the  matrices  eval(a),  oi(a,  b), 
and  proj(fr)  via  the  relation 

(3.10)  direct(a,  b)  =  eval(a)  o  oi(a,  b)  o  proj(6)  +  0(e). 

To  prove  (3.10),  we  first  note  that  direct(a,  b)  is  defined  by  the  relation 

(3.11)  ua  =  direct  (a,b)qb. 

Moreover,  (3.5),  (3.9),  and  (3.2),  imply  that 

(3.12)  ua  =  eval(o)  cf>a  -|-  0(e), 

(3.13)  <j)a  =  oi(a,b)ipb  +  0(e), 

(3.14)  ipb  =  proj(6)  qb, 

respectively.  Since  (3.11)  must  hold  for  every  qb,  the  equations  (3.11),  (3.12), 
(3.13),  and  (3.14)  together  imply  (3.10).  (We  remark  that  equation  (3.10) 
is  analogous  to  equation  (3.1)  in  [4].) 

3.5.  Merging  the  representations  of  two  boxes.  In  this  section  we 
describe  a  procedure  for  the  construction  of  the  outgoing  representation  of 
a  set  b  if  the  outgoing  representations  are  known  for  two  sets  bi  and  62  such 
that  b  =  bi  U  62. 

We  let  a  denote  the  set  of  points  that  are  well-separated  from  b,  and  for 
j  —  1,2,  we  similarly  let  cij  denote  the  set  of  points  that  are  well-separated 
from  bj.  It  follows  that  a  C  a,j.  Typically,  a  is  strictly  smaller  than  aj. 

We  let  ua  denote  the  potential  induced  on  the  locations  (xm)^=1  c  a  by 
two  given  charge  distributions  in  b\  and  62  with  outgoing  representations  ipO) 
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and  ip(2\  Furthermore,  we  let  y^  and  y®  be  the  corresponding  outgoing 
skeletons.  (We  recall  that  y^  and  y®  are  subsets  of  the  original  sets 
of  locations  in  61  and  62.)  Merging  these  two  pairs  of  vectors,  we  form 
■tp  =  [ipWj'ipW],  and  y  —  [yd), y(2)].  Since  a  is  wholly  contained  in  flj  D  02, 
equation  (3.6)  implies  that 

ki+k2 

(3.15)  uam  =  K(xm,  yj)  ipj  +  0(g),  for  m  =  1, . . . ,  M, 

3- 1 

where  k\  and  k%  are  the  lengths  of  ip^  and  We  rewrite  equation  (3.15) 
as  a  matrix-multiplication  by  introducing  the  M  x  (k\  +  k^)  matrix  A  with 
entries  Amj  =  K(xm,yj), 

(3.16)  u“  =  A^  +  0(e). 

Equation  (3.15)  says  that  ^  is  a  valid  outgoing  representation  for  6  with 
an  associated  outgoing  skeleton  y.  However,  when  a  contains  fewer  target 
locations  than  01  U  a-i  (which  is  typically  the  case),  this  representation  is 
likely  to  be  longer  than  necessary.  To  be  precise,  the  length  of  an  optimal 
representation  equals  the  e-rank  of  the  matrix  A  in  (3.16),  which  we  denote 
by  k.  To  obtain  an  optimal  representation,  we  factor  A  as  in  (2.1), 

A  =  Aco\  o  Z  +  O(e), 

where  Z  is  a  k  x  (Aq  +  £2)  matrix,  and  Aco\  consists  of  k  columns  of  A.  We 
denote  the  indices  of  these  by  (ji)f=1.  An  optimal  outgoing  representation 
is  then  the  vector  ipb  =  Z  ip;  it  is  associated  with  the  outgoing  skeleton 
(Vji)i= i-  Letting  00(6,61)  denote  the  matrix  formed  by  the  first  k\  columns 
of  Z ,  and  letting  the  remaining  columns  form  00(6, 62),  we  find  that 

ipb  =  00  (6, 61)V’(1)  +  00(6,  62)  t^2). 

Analogously,  we  construct  matrices  ii(6i,6)  and  ii (62 , 6)  that  construct 
the  incoming  representations  0d)  and  <f>^  for  61  and  62  from  the  incoming 
representation  of  6  via  the  formulas 

(3.17)  ^>d)  =  ii(61)  6)  (pb,  and  <f>^  =  ii(&2, 6)  4>b. 

We  refer  to  the  matrix  00(6,61)  as  an  outgoing-to-outgoing  translation 
operator,  while  the  matrix  ii(6i,  6)  is  referred  to  as  an  incoming-to-incoming 
translation  operator. 

3.6.  Efficient  construction  of  translation  operators.  The  techniques 
for  constructing  translation  operators  that  were  given  in  Sections  3.3,  3.4, 
and  3.5  could  potentially  be  quite  expensive.  For  instance,  when  computing 
the  operator  proj(6)  in  Section  3.3,  we  considered  the  interaction  between 
the  set  of  locations  in  6,  and  the  set  of  all  locations  in  Q  that  are  well- 
separated  from  6.  The  second  set  is  typically  very  large.  However,  in  most 
instances  of  practical  interest,  the  process  can  be  accelerated  by  replacing 
this  very  large  set,  by  a  small  set  of  pre-determined  locations  in  a  that  act 
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as  proxies  for  the  actual  charge  locations.  This  acceleration  technique  works 
for  kernels  satisfying  the  following: 

Assumption  I:  Let  b  denote  a  subset  of  a  computational  domain  Q,  and  let 
a  denote  the  set  of  points  in  U  that  are  well-separated  from  b.  We  assume 
that  for  any  positive  number  e,  there  exist  interpolation  points  {zi)pi=l  C  a 
and  functions  such  that 

p 

(3.18)  sup  sup \K  (x,  y)  ~y2<pi(x)K(zi,y)\  <  e. 

xea  yeb  i=1 

The  number  of  terms  required,  p,  is  assumed  to  satisfy 

p<  C|loge|, 

as  s  — >  0.  For  non-symmetric  kernels,  we  also  assume  that  there  exist  points 
(wi) f=i  C  a  and  functions  (ipz)p=l  such  hat 

p 

(3.19)  sup  sup  \K(y,  x)  -'Y]K(y,wi)^i(x)  |  <  e. 

x€a  yeb  j_2 

Remark  10.  It  is  shown  in  [16]  that  Assumption  I  holds  for  any  kernel  that 
is  separable  in  the  sense  described  in  Remark  7. 

Suppose  now  that  the  kernel  K  satisfies  Assumption  I.  Then  for  for  any 
given  set  6  C  Cl,  one  can  quite  inexpensively  construct  an  outgoing  represen¬ 
tation  that  is  valid  for  evaluating  potentials  at  the  points  z\.  Assumption  I 
then  assures  us  that  this  representation  is  also  valid  at  any  other  point  that 
is  well-separated  from  b,  since  the  potential  there  can  be  interpolated  locally 
from  the  potential  at  the  zfs.  Incoming  representations  can  be  constructed 
analogously. 

Remark  11.  In  general,  the  cost  of  constructing  the  matrix  proj(fr)  and 
determining  the  outgoing  skeleton  (ynj)j= i,  is  0(kNM )  where  k  is  the  e- 
rank  of  interaction,  N  is  the  number  of  points  in  6,  and  M  is  the  number 
of  locations  that  are  well-separated  from  b.  When  the  kernel  K  satisfies 
Assumption  I,  this  cost  can  be  reduced  to  0(kNp),  with  no  dependence  on 
M. 

3.7.  Representations  with  extended  domains  of  validity.  For  a  given 
box  b,  we  have  so  far  constructed  outgoing  and  incoming  representations 
that  are  valid  with  respect  to  locations  in  the  computational  domain  that 
are  separated  from  a  box  b  by  at  least  the  diameter  of  b.  This  require¬ 
ment  of  a  “buffer”  zone  is  a  standard  feature  of  fast  summation  techniques. 
However,  the  particular  technique  for  representing  functions  described  in 
Section  3.3  can  also  be  used  to  compress  the  interaction  between  adjacent 
boxes.  The  resulting  representations  are  very  similar  to  the  ones  constructed 
for  the  interactions  between  separated  boxes.  However,  the  expansions  are 
longer,  and  should  not  be  used  unless  necessary.  We  call  such  unbuffered 
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representations  “rich  representations” ,  and  denote  the  outgoing  and  the  in¬ 
coming  rich  representations  by  and  <£b,  respectively.  The  corresponding 
projection  and  evaluation  operators  are  called  Proj,  and  Eval  (capitalized), 
respectively,  so  that  for  a  given  box  b 

(3.20)  *6  =  Proj  (6)  g6, 

(3.21)  ub  =  Eval(6)  $6. 

The  representation  '3/ b  in  (3.20)  can  be  used  to  compute  the  potential  in¬ 
duced  by  qb  at  any  location  outside  b,  while  the  representation  may  be 
used  to  reconstruct  any  potential  ub  induced  by  a  distribution  of  charges  at 
any  of  the  locations  outside  b. 

Remark  12.  In  the  original  FMM,  and  its  variants,  the  existence  of  a  buffer 
between  source  and  target  boxes  was  necessary  to  assure  that  the  classical 
representations  (multipole  series,  etc.)  are  valid,  and  have  controlled  con¬ 
vergence  rates.  Here,  we  eliminate  the  need  for  the  buffers  by  constructing 
the  representations  via  numerical  techniques  (as  opposed  to  using  analytic 
properties  of  the  kernel) .  Such  representations  are  possible  because  they  are 
not  required  to  be  valid  everywhere  in  the  boxes,  but  only  at  a  finite  number 
of  source  and  target  points. 

Remark  13.  If  for  a  given  box  b,  we  wish  to  compute  both  its  rich  and  its 
regular  outgoing  representations,  the  most  efficient  way  for  doing  so  is  to 
first  compute  the  rich  representation  \Efb,  and  then  compute  the  regular  one, 
ipb,  from  it. 

In  order  to  construct  the  matrix  that  maps  to  we  first  construct 
the  matrix  A  with  entries  Amj  =  K(xm,yj),  where  (yj)f=  i  are  the  points 
in  the  rich  skeleton,  and  (rm)^=1  is  a  set  of  well-separated  target  points. 
Letting  k  denote  the  e-rank  of  A,  we  factor  A  as  in  (2.1), 

A  =  Acoi  o  proj (6)  +  0(e), 

where  Aco\  is  the  M  x  k  matrix  consisting  of  the  columns  of  A  with  indices 
(jx)i=i>  and  proj (6)  is  a  k  x  K  matrix.  If  ’F6  is  an  outgoing  representation 
associated  with  the  rich  outgoing  skeleton  (yj)f=  i,  then 

=  proj  (b)  b 

is  a  regular  outgoing  representation  for  b  associated  with  the  outgoing  skele¬ 
ton  (yjt)-=v  _ 

Analogously,  we  construct  an  operator  eval(6)  that  constructs  a  rich  in¬ 
coming  representation  from  a  regular  one: 

(3.22)  =  eval(6)  cf>b. 

Other  types  of  translation  operators  involving  rich  representations  can  be 
constructed,  but  are  not  needed  for  the  purposes  of  this  paper. 
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4.  A  KERNEL-INDEPENDENT  FAST  SUMMATION  TECHNIQUE 

4.1.  Problem  formulation.  In  this  section,  we  consider  the  problem  of 
evaluating  the  sum 


N 

(4.1)  Ui  =  ^K(xi,Xj)qj,  z  =  l,...,  IV, 

j=i 


given  a  set  of  real  numbers  a  set  of  real  or  complex  numbers  (qj)f-i, 

and  a  kernel  K(x,y).  We  call  the  numbers  Xi  “locations”,  the  numbers 
qi  “charges”,  and  the  numbers  u,  “potentials”.  The  numbers  Xj  are  all 
contained  in  an  interval  Cl  called  the  “computational  domain” . 

We  focus  on  the  situation  where  the  potentials  (u,) are  to  be  evaluated 
for  a  sequence  of  charge  distributions  (qi)iLi  associated  with  a  single  set 
of  locations  (xi)^=l.  In  this  environment,  we  spend  a  moderate  amount 
of  computational  effort  on  optimizing  the  representations  and  the  various 
translation  operators  used  for  the  given  set  of  locations.  Once  this  has  been 
done,  each  potential  evaluation  can  be  performed  rapidly. 

The  summation  technique  presented  in  this  section  follows  the  same  tem¬ 
plate  as  earlier  versions  of  the  fast  multipole  method,  [8,  10,  7,  5] .  The  only 
difference  between  these  methods,  and  the  method  presented  in  this  section 
is  that  a  different  representation  for  potentials  and  sources  is  used.  In  Sec¬ 
tion  5,  a  further  acceleration  of  the  method  is  described.  This  acceleration 
is  obtained  by  compressing  the  interactions  between  adjacent  regions. 


4.2.  Tree  structure.  Given  a  computational  domain  Cl  =  [xjeft,  bright)  C 
M,  and  a  set  of  locations  (xn)^=1  C  Cl,  we  construct  an  adaptive  partitioning 
of  the  domain  into  subintervals  in  such  a  way  that  no  interval  contains  more 
than  No  locations,  for  some  given  (small)  integer  Nq.  We  do  this  via  a 
hierarchical  subdivision  process  in  which  any  interval  holding  more  than  No 
points  is  split  into  two  halves,  and  then  the  process  is  continued  with  each 
half  that  in  turn  contains  more  than  No  points.  If  a  =  [x“eft,  x“ight)  is  an 
interval  resulting  from  this  process,  then 


2- left  —  3-left  {j  1)  2  (bright  left)  j 
■^right  =  'Qeft  ~\~  j  2  (bright  2- left)  > 


for  some  number  l  =  0,1,2,...,  and  some  number  j  =  1,2,  The 

number  l  is  called  the  “level”  of  the  box  a  and  denotes  how  many  times  Cl 
has  been  cut  in  half  to  reach  a. 
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For  every  box  a,  we  construct  the  following  lists  of  other  boxes: 

Lchiidren(a):  For  a  non-leaf  box  a,  this  is  a  list  of  all  boxes  b  that  are 

contained  in  a,  and  that  are  separated  from  a  by  one  level  only. 

If  b  €  Lchiidren(a),  we  say  that  b  is  a  “child”  of  a ,  and  that  a  is 
a  “parent”  of  b.  For  a  leaf  box,  Lchiidren(a)  is  empty. 

Lciose(a):  For  a  leaf-box  a,  this  is  a  list  of  all  leaf  boxes  b  that  are  not 

well-separated  from  a.  For  a  non-leaf  box,  Lciose(a)  is  empty. 

£12(0):  The  list  of  all  boxes  b  on  the  same  level  as  a  that  are  well-separated 

from  a,  but  whose  parents  are  not  separated  from  the  parent  of  a. 
(This  list  is  known  as  the  “interaction  list”  in  the  original  FMM.) 

4.3.  Outgoing  and  incoming  representations.  For  any  box  a,  we  let 
the  vector  of  charges  inside  the  box  be  denoted  by  qa,  and  with  proj(a)  the 
matrix  defined  in  Section  3.3,  we  let  the  vector 

(4.2)  '0°  =  proj(a)  qa , 

denote  the  outgoing  representation  of  a.  Similarly,  we  let  ti“ar  denote  the 
potential  on  a  caused  by  all  charges  that  are  well-separated  from  a.  An 
incoming  representation  for  a  is  a  vector  <j)a  such  that 

(4.3)  uL  =  eval(o)  <£“  +  O(e), 
where  the  operator  eval(a)  is  defined  in  Section  3.3. 

Remark  14.  In  the  original  fast  multipole  method,  the  function  of  the 
vector  -ipa  was  performed  by  a  vector  of  multipole  coefficients  representing 
qa,  and  the  function  of  <f>a  was  performed  by  a  vector  of  expansion  coefficients 
for  u“ar  in  a  basis  of  harmonic  polynomials. 

4.4.  Hierarchical  construction  of  representations.  Both  the  incoming 
and  the  outgoing  representations  can  be  computed  recursively.  Specifically, 
if  a  is  any  non-leaf  box  a,  its  outgoing  representation  can  be  computed  from 
the  outgoing  representations  of  its  children  via  the  formula 

(4.4)  ipa  =  °°(a’ 6) 

fr^Lchildren  ( 

where  the  matrices  oo(a,  b)  are  defined  in  Section  3.5.  Similarly,  if  a  is  any 
box  other  than  the  root  box,  the  incoming  representation  <j)a  (representing 
the  potential  u“ar  caused  by  charges  in  all  boxes  that  are  well-separated  from 
a)  can  be  constructed  by  combining  the  incoming  potential  of  its  parent  b 
with  the  outgoing  potentials  of  all  boxes  in  the  interaction  list  of  a  via  the 
formula 

(4.5)  (j)a  =  ii(a,  b)  4>h  +  oi(a,  c) 

ceLi2(a) 

where  the  list  L^a)  is  defined  in  Section  4.2,  the  matrix  ii(a,  b)  is  defined  in 
Section  3.5,  and  the  matrices  oi(a,  c)  are  defined  in  Section  3.4. 
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4.5.  Pre-computation.  Given  a  set  of  locations  (xn);)=1  and  a  kernel  func¬ 
tion  K(x,y ),  the  pre-computation  stage  consists  of  the  following  steps: 

(1)  Divide  the  computational  domain  into  a  tree  structure,  as  described 
in  Section  4.2. 

(2)  Construct  the  lists  described  in  Section  4.2. 

(3)  Loop  over  all  leaf  boxes.  For  each  box  a,  construct  the  regular  outgo¬ 
ing  and  incoming  skeletons.  Simultaneously,  determine  the  matrices 
proj(a)  and  eval(a),  described  in  Section  3.3. 

(4)  Loop  over  all  non-leaf  boxes,  going  from  finer  to  coarser  levels.  For 
each  box  o,  construct  the  regular  outgoing  and  incoming  skeletons  by 
merging  the  skeletons  of  its  children  Simultaneously,  determine 
the  matrices  oo(a,  6;),  and  ii(6j,  a),  as  described  in  Section  3.5. 

(5)  Loop  over  all  boxes  a,  and  then  over  all  elements  b  in  the  list  L2(a). 
For  each  such  pair  (a,  b ),  construct  the  translation  operator  oi(a,  b ), 
as  described  in  Section  5.3. 

The  total  cost  of  the  steps  described  above  depends  on  the  kernel  and  on 
the  charge  distribution.  If  the  kernel  satisfies  Assumption  I  in  Section  3.6, 
then  the  computational  cost  is  typically  either  0(N),  or  0(N  log  N),  and 
the  amount  of  storage  required  is  typically  O(N),  cf.  Remark  15. 

4.6.  A  general  fast  multipole  method.  We  have  now  assembled  the  tools 
for  computing  the  sum  (4.1)  through  two  passes  through  the  hierarchical 
tree;  one  upwards,  and  one  downwards. 

(1)  Sweep  over  all  leaf  boxes  a.  For  each  box,  construct  its  outgoing 
representation  from  the  values  of  the  charges  inside  it,  cf.  (4.2): 

ifa  =  proj(a)  qa. 

(2)  Sweep  over  all  non-leaf  boxes  a,  going  from  finer  to  coarser  levels. 
For  each  box  a,  construct  its  outgoing  representation  by  merging  the 
outgoing  representations  of  its  children,  cf.  (4.4): 

-0“=  oo  {a,b)tpb. 

b€Lchildren(a) 

(3)  Set  4>r  =  0  for  the  root  box  r.  Then  loop  over  all  boxes  (including  the 
root  box),  going  from  coarser  to  finer  levels.  For  each  box  a,  form  its 
incoming  representation  by  combining  the  incoming  representation 
of  its  parent,  box  b,  with  the  contributions  from  the  boxes  in  L2(a) 
(its  “interaction  list”),  cf.  (4.5): 

0“  =  ii(a,  b)  4>b  +  ^  oi(a,  c)  0C. 

ceLi2(a) 

(4)  Sweep  over  all  leaf  nodes  a.  For  each  node,  form  the  potential  ua 
by  evaluating  the  incoming  representation  and  directly  adding  the 


13 


contributions  from  the  charges  inside  a  and  in  all  boxes  that  are  not 
well-separated  from  a,  cf.  (4.3): 

ua  =  eval(a)  &>a  +  direct(a,  a)  qa  +  direct(a,  c)  qc. 

ceLdose  ( l-! ) 

For  kernels  satisfying  Assumption  I,  the  computational  cost  of  the  steps 
described  in  this  subsection  is  typically  O(N). 

Remark  15.  It  is  possible  to  construct  (highly  non-uniform)  charge  distri¬ 
butions  for  which  the  computational  cost  of  the  steps  described  in  Sections 

4.5  and  4.6  exceed  O(N).  A  discussion  of  this  phenomenon,  and  a  mod¬ 
ification  to  the  scheme  that  makes  it  always  retain  0(N )  complexity,  are 
described  in  [19]. 

5.  An  accelerated  kernel  independent  fast  multipole  method 

5.1.  Outline.  In  this  section,  we  again  consider  the  problem  of  rapidly  eval¬ 
uating  the  sum  (4.1).  We  describe  a  technique  for  doing  so  that  is  similar 
to  the  technique  described  in  Section  4,  but  with  the  difference  that  even 
interactions  between  adjacent  boxes  are  compressed.  To  enable  this  addi¬ 
tional  compression,  we  keep  track  of  four  representations  for  each  leaf  box; 
the  regular  outgoing  and  incoming  ones  described  in  Section  3.3,  and  also 
the  rich  outgoing  and  incoming  representations  described  in  Section  3.7. 
The  rich  representations  are  used  exclusively  for  the  purpose  of  evaluating 
interactions  involving  at  least  one  leaf  box. 

The  asymptotic  cost  for  the  algorithm  described  in  this  section  scales  in 
the  same  way  with  N  as  the  cost  for  the  algorithm  described  in  Section 

4.6  (typically,  0(N log N)  for  pre-computation,  and  O(N)  for  evaluation). 
However,  the  constants  involved  are  smaller. 

5.2.  Tree  structure.  The  accelerated  algorithm  uses  the  same  tree  struc¬ 
ture  that  was  described  in  Section  4.2.  In  addition  to  the  lists  described  in 
that  section,  the  accelerated  algorithm  also  uses  the  following  three  lists: 

Li(a):  For  a  leaf  box  a,  this  is  a  list  of  the  leaf  boxes  that  directly 
border  a.  For  a  non-leaf  box,  Li(a)  is  empty. 

1,3(0):  For  a  leaf  box  a,  this  is  a  list  of  all  boxes  on  finer  levels  than  a 
that  are  separated  from  a  but  whose  parents  are  not  separated 
from  the  parent  of  a.  For  a  non-leaf  box  a,  La(a)  is  empty. 

1,4(0):  The  dual  of  L3.  In  other  words,  b  €  L4 (a)  if  and  only  if  a  6  1^(6). 

5.3.  Translation  operators.  The  algorithm  described  in  Section  4.6  uti¬ 
lizes  a  single  type  of  outgoing-to-incoming  translation  operator.  This  op¬ 
erator  maps  a  regular  outgoing  representation  tpa  to  a  regular  incoming 
representation  d>b  for  all  pairs  (a,  b)  such  that  a  6  L2  (b).  This  translation 
operator  is  also  used  in  the  accelerated  algorithm  described  in  this  section, 
we  label  it  oi2  (a,  6).  The  accelerated  algorithm  requires  three  additional 
outgoing-to-incoming  translation  operators:  For  each  pair  (a,  b)  such  that 
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b  €  Li  (a),  the  operator  04  (a,  b)  maps  a  rich  representation  to  a  rich  rep¬ 
resentation.  For  each  pair  (a,  b)  such  that  b  €  Ls(a),  the  operator  oi3  (a,b) 
maps  a  regular  representation  to  a  rich  representation.  For  each  pair  (a,  b) 
such  that  b  €  L4(a),  the  operator  04  (a,b)  maps  a  rich  representation  to  a 
regular  representation.  Each  matrix  oi j(a,b)  is  the  restriction  of  the  origi¬ 
nal  matrix  of  interaction  direct  (a,  6)  to  the  relevant  outgoing  and  incoming 
skeletons  for  a  and  b,  respectively. 

5.4.  Pre-computation.  The  pre-computation  for  the  accelerated  algorithm 
is  very  similar  to  the  pre-computation  described  in  Section  4.5.  The  differ¬ 
ence  is  that  for  all  leaf  nodes,  we  compute  both  the  regular  and  the  rich 
skeletons  (and  the  corresponding  operators  eval  and  proj).  We  also  de¬ 
termine  the  additional  lists  Li,  L3,  and  L4,  as  well  as  the  corresponding 
translation  operators  04,  04,  and  04. 

5.5.  Potential  evaluations.  After  the  particle  locations  (xn)^=l,  and  the 
kernel  K(x,y)  have  been  fixed,  and  the  pre-computation  described  in  Sec¬ 
tion  5.4  has  been  completed,  the  following  steps  will  compute  the  vector  of 
potentials  (un)^=1  from  a  vector  of  charges  (qn)n=i  >  (c/-  (4-1)),  to  within  a 
precision  of  computations  e: 

(1)  Loop  over  all  leaf  boxes.  For  each  box  a,  compute  'I'a  and  then  ipa: 

'Fa  =  Proj  (a)  qa,  and  then  ipa  =  proj  (a)  \I/a. 

(2)  Loop  over  all  non- leaf  boxes,  going  from  finer  levels  to  coarser.  For 
each  box  a,  compute  -0“  by  combining  the  outgoing  representations 
of  its  children, 

v  =  °°(a> fe)  ^ b ■ 

fr^Lchildren  (a) 

(3)  Loop  over  all  leaf  boxes.  For  each  box  a,  add  up  the  contributions 
from  the  boxes  in  Li(a), 

$1=  oi x(a,6)^. 

fcGLi  (a) 

(4)  Loop  over  all  boxes.  For  each  box  a,  add  up  the  contributions  from 
the  boxes  in  L2(a), 

4>2=  Y!  oi2 (a,b)ipb. 

b£L2(a) 

(5)  Loop  over  all  leaf  boxes.  For  each  box  a,  add  up  the  contributions 
from  the  boxes  in  Ls(a), 

$3=  oi  3(a,b)ipb: 

fc£L3(a) 
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(6)  Loop  over  all  boxes.  For  each  box  a,  add  up  the  contributions  from 
the  boxes  in  L^a), 

<t>4  =  E 
beh4(a) 

(7)  Set  (f>r  =  0  for  the  root  box  r.  Then  loop  over  all  non-leaf  boxes 
(including  the  root  box),  from  coarser  levels  to  finer.  For  each  box 
a,  and  for  each  child  b  of  a,  construct  the  incoming  regular  represen¬ 
tations  for  b: 

$  =  4* 2  +  ^4  +  ii(2>,  a)  cj)a . 

(8)  Loop  over  all  leaf  boxes.  For  each  box  a,  construct  its  rich  incoming 
representation  by  adding  the  various  contributions: 

+  eval(a)  4>a. 

(9)  Loop  over  all  leaf  boxes.  For  each  box  a,  construct  ua  by  interpolat¬ 
ing  $Q  and  adding  the  contributions  from  qa: 

ua  =  Eval(o)  +  direct(a,  a)qa. 

Remark  16.  The  vectors  <f> 2,  $j,  <hf,  and  $3  are  never  stored;  they  are 
added  directly  to  either  cpa  or  $>a  as  they  are  computed. 

6.  Numerical  examples 

The  numerical  algorithm  described  in  Section  5  has  been  implement  in 
FORTRAN  77  and  tested  on  several  model  problems.  In  this  section,  we 
summarize  the  results  from  tests  involving  three  different  kernels: 

Example  1:  This  example  involves  the  evaluation  of  the  sum 
N 

(6.1)  um=  xn\)qn,  for  m  =  l,...,iV, 

n—  1 
n^m 

where  the  points  (xn)^=1  were  drawn  from  a  uniform  random  distribution 
on  the  interval  [0,  1]. 

Example  2:  This  example  involves  the  evaluation  of  the  sum 

n\  V-''  Pk+l(xm)Pk(xn)  Pk{pCm)Pk+l  ixn)  f  i  AT 

(6.2)  um  =  )  — — - ^ ^ - - -  — -  qn ,  for  m  =  1, . . . ,  jV, 

/  wJ  nr  _  nr 

71=1 

where  pk  is  the  fc’th  Legendre  polynomial,  and  the  points  (xn)^=1  are  the 
Gaussian  nodes  on  the  interval  [—1,  1].  The  sum  (6.2)  arises  in  evaluating 
orthogonal  projections  onto  the  space  of  order  k  polynomials  in  L2([— 1, 1]), 
see  [13].  Similar  sums  are  encountered  in  the  construction  of  fast  algorithms 
for  the  harmonic  expansions  on  the  sphere,  in  the  FMM  for  the  Helmholtz 
and  Maxwell  equations,  etc.,  see  [3,  20].  In  the  numerical  examples  reported, 
k  was  one  third  of  N  (rounded  to  the  nearest  integer). 
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Matrix  elements 
computed  on  the  fly 


Matrix  elements 
pre-computed 


Example  1 
Example  2 
Example  3 


11 

20 

14 


Table  1.  Breakeven  points  extrapo 


52 

60 

94 

ated  from  Figure  2. 


Example  3:  This  example  involves  the  evaluation  of  the  sum 


(6.3) 


N 


=  E 


sin(a(xm  -  xn )) 


n=  1 


Qni 


for  m  =  1, . . . ,  N, 


where  (xn)„=l  are  equispaced  nodes  in  the  interval  [—1, 1].  Such  sums  occur 
frequently  in  signal  processing  and  many  other  areas,  see  [2,  6,  21].  The 
parameter  a  was  chosen  so  that  there  were  5  nodes  per  wavelength,  i.e. 
a  =  nN/5. 


The  CPU  times  required  for  the  accelerated  matrix-vector  multiplication 
in  the  three  examples  are  given  in  Figure  2.  The  experiments  were  carried 
out  on  a  3.2GHz  Pentium  IV  desktop  with  2Gb  of  RAM.  All  calculations 
shown  were  carried  out  with  a  requested  accuracy  of  e  =  10-10.  Detailed 
CPU  time  requirements  are  given  in  Tables  2,  3,  and  4  in  Appendix  A. 
These  tables  also  report  the  memory  requirements  of  the  algorithm,  as  well 
as  the  time  required  for  the  pre-computing  step. 

For  comparison,  Figure  2  also  reports  the  CPU  times  required  for  un¬ 
compressed  matrix-vector  multiplies,  both  for  the  case  where  the  matrix 
elements  are  pre-computed  and  stored,  and  the  case  where  the  matrix  ele¬ 
ments  are  computed  on  the  fly.  The  break-even  points  obtained  by  extrap¬ 
olating  the  lines  in  Figure  2  are  given  in  Table  1.  Finally,  Figure  2  reports 
the  CPU  time  requirement  for  an  FFT  of  length  N  (with  equispaced  nodes). 
We  note  that  for  large  problems,  the  matrix-vector  multiply  reported  here 
is  about  5-10  times  slower  than  an  FFT. 

In  order  to  investigate  the  dependence  of  the  CPU  time  requirement  on 
the  requested  accuracy,  the  matrix-vector  multiplication  in  Example  1  was 
carried  out  for  e  =  10-3-5,  10~7,  and  10-14.  The  resulting  CPU  times  are 
reported  in  Figure  3.  This  table  also  specifies  the  CPU  times  required  for 
the  pre-computational  stage.  We  remark  that  no  attempt  whatsoever  was 
made  to  optimize  this  part  of  the  code,  and  its  CPU  time  requirements  can 
be  reduced  dramatically. 


7.  Conclusions 

This  paper  describes  a  fast  multipole  method  for  the  rapid  evaluation 
of  sums  of  the  form  (1.1).  In  one  and  two  dimensions,  the  computational 
complexity  of  this  method  is  O(N). 
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FIGURE  2.  The  CPU  times  (in  seconds)  required  for  evalu¬ 
ating  the  sums  (6.1),  (6.2),  and  (6.3)  versus  problem  size,  N. 

The  markers  ’x’,  ’+’,  and  ’o’  label  the  three  cases.  The  dot¬ 
ted  lines  mark  the  times  required  for  uncompressed  matrix- 
vector  multiplies,  the  solid  lines  mark  the  times  required  for 
the  algorithm  of  Section  5.  The  times  for  a  matrix- vector 
multiply  with  a  stored  matrix  are  marked  with  ’o’  and  the 
times  for  an  FFT  are  marked  ’v’- 

The  method  does  not  use  any  analytic  expansions  of  the  kernel;  instead, 
it  numerically  compresses  the  kernel  in  a  pre-processing  stage  whose  com¬ 
putational  cost  is  0(N  log  N).  The  combined  cost  of  the  pre-computation, 
and  a  single  potential  evaluation  using  the  present  scheme  is  larger  than 
the  cost  of  a  single  evaluation  using  some  existing  pre-computation  free  fast 
summation  techniques.  However,  if  the  evaluation  is  to  be  performed  for 
a  sequence  of  different  charge  distributions  (on  a  fixed  set  of  charge  loca¬ 
tions),  then  the  current  method  outperforms  most  existing  methods,  with 
the  important  exception  of  convolutional  sums  that  can  be  evaluated  using 
the  FFT.  Moreover,  since  the  current  scheme  does  not  explicitly  rely  on 
analytical  properties  of  the  kernel,  it  is  applicable  in  many  environments  in 
which  pre-computation  free  methods  are  not  available. 

One  application  that  seems  particularly  well  suited  for  the  fast  summation 
technique  of  this  paper  concerns  the  rapid  computation  of  the  singular  value 
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Figure  3.  The  CPU  time  required  for  evaluating  the  sum  in 
(6.1)  with  s  =  10“3-5,  10~7,  and  10-14,  using  the  algorithm 
of  Section  5,  plotted  against  problem  size,  N.  The  solid  lines 
give  the  time  for  a  matrix-vector  multiply,  while  the  dotted 
lines  give  the  time  required  for  pre-computation. 

decomposition  of  a  matrix.  A  very  fast  algorithm  for  this  task  based  on  a 
divide-and-conquer  technique  is  described  in  [11].  A  core  observation  of  [11] 
is  that  the  seemingly  expensive  task  of  updating  a  unitary  matrix  can  - 
surprisingly  -  be  accelerated  using  the  fast  multipole  method.  However,  the 
break-even  point  of  previous  versions  of  the  fast  multipole  method  made  the 
algorithm  of  [11]  competitive  only  for  very  large  matrices.  Research  into 
combining  the  fast  summation  technique  of  this  paper  with  the  algorithm 
of  [11]  is  currently  under  way. 

Acknowledgments:  The  authors  wish  to  thank  M.  Tygert  for  constructive 
discussions. 


Appendix  A.  Computational  results 

This  appendix  contains  detailed  statistics  for  the  computational  experi¬ 
ments  summarized  in  Section  6.  The  numbers  upon  which  Figure  2  is  based 
can  be  found  in  Tables  2,  3,  and  4.  The  numbers  upon  which  Figure  3  is 
based  can  be  found  in  Table  5.  Table  6  provides  the  CPU  times  required 
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for  an  uncompressed  matrix-vector  multiply,  and  a  length- iV  FFT.  The  im¬ 
plementation  of  the  FFT  is  the  one  found  in  FFTPACK. 

The  following  numbers  are  given  for  each  experiment: 

N  Problem  size. 

e  Requested  accuracy. 

Emax  Maximum  error  (see  Remark  17). 

Erms  Root  mean  square  error  (see  Remark  17). 

tpre  CPU  time  required  for  pre-computation. 

tevai  CPU  time  required  for  an  accelerated  matrix- vector  multiply. 

Mkeep  Amount  of  memory  required  to  store  the  compressed  operator. 

Mpre  Amount  of  memory  required  for  pre-computation. 

tuncomp  CPU  time  required  for  an  uncompressed  matrix-vector  multiply. 

tmatvec  CPU  time  required  for  multiplying  a  stored  matrix  by  a  vector. 

CPU  time  required  for  a  length-iV  (equispaced)  FFT. 

All  CPU  times  are  given  in  seconds,  and  all  memory  requirements  are  given 
in  terms  of  storage  for  double  precision  reals.  The  memory  required  for 
applying  the  operator  to  a  vector  is  not  reported  since  it  is  far  smaller  than 
Mkeep  (asymptotically,  it  is  a  couple  of  reals  per  node). 


Remark  17.  We  report  both  the  maximum  error  Emax  and  the  root  mean 
square  error  Erms.  Letting  (un)^=1  denote  the  result  of  an  uncompressed 
matrix-vector  multiply,  and  letting  (u'n)^=l  denote  the  result  of  an  acceler¬ 
ated  matrix- vector  multiply,  we  have 


/  \  i  \  171  _  m^Xl<n<Ar  |^n  un  I  TP  _ 

(A.l)  £max-  N  ,  and  Eims  - 

(l/Aj  Ljn=  1  lU”l  V  Z^n= 1  . 

In  each  of  the  experiments  reported,  the  charges  qn  were  drawn  from  a 
uniform  random  distribution  on  [—1,1]. 


u' 


IeL iK 


X)2 


jn= 


\llr 
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N 

tpre/N 

■E'max 

-E'rms 

^eval/  N 

•^keep/  N 

Mpre/N 

^uncomp  /  N 

1000 

1.8E-04 

2.3E-10 

3.0E-11 

5.0E-07 

9.9E+01 

1.4E+03 

7.0E-05 

2500 

2.6E-04 

2.5E-10 

2.6E-11 

6.8E-07 

1.0E+02 

7.5E+02 

1.6E-04 

5000 

2.5E-04 

3.9E-10 

3.1E-11 

7.0E-07 

1.0E+02 

4.9E+02 

3.1E-04 

10000 

3.2E-04 

2.4E-10 

2.9E-11 

7.1E-07 

1.0E+02 

3.3E+02 

6.2E-04 

25000 

3.0E-04 

2.4E-10 

2.7E-11 

7.1E-07 

1.1E+02 

2.1E+02 

1.5E-03 

50000 

3.6E-04 

3.0E-10 

2.5E-11 

7.1E-07 

1.1E+02 

1.7E+02 

3.1E-03 

100000 

3.4E-04 

2.0E-10 

2.4E-11 

7.1E-07 

1.1E+02 

1.4E+02 

6.2E-03 

Table  2.  Computational  results 

for  Example  1,  with 

e  =  10~10 

N 

tpre/  N 

-E'max 

•firms 

^eval/-^ 

Mkeep/N 

MpTe/N 

^uncomp  /  N 

1000 

2.2E-04 

1.5E-10 

8.9E-13 

6.0E-07 

1.1E+02 

1.3E+03 

4.0E-05 

2500 

2.9E-04 

1.6E-09 

3.6E-12 

7.6E-07 

1.1E+02 

8.0E+02 

9.2E-05 

5000 

2.6E-04 

1.5E-09 

1.6E-12 

7.6E-07 

1.2E+02 

5.4E+02 

1.8E-04 

10000 

2.9E-04 

2.0E-09 

1.0E-12 

7.5E-07 

1.2E+02 

3.7E+02 

3.6E-04 

25000 

3.9E-04 

4.6E-09 

1.4E-12 

7.7E-07 

1.1E+02 

2.4E+02 

9.1E-04 

50000 

4.3E-04 

2.4E-09 

5.5E-13 

7.7E-07 

1.1E+02 

1.8E+02 

1.9E-03 

100000 

5.5E-04 

6.4E-08 

2.5E-12 

7.7E-07 

1.1E+02 

1.5E+02 

3.7E-03 

Table  3.  Computational  results 

for  Example  2,  with 

£  =  10~10 

N 

^pre/  N 

fi'max 

firms 

^eval/-^ 

Mfaep/N 

Mpre/N 

^uncomp/  N 

1000 

7.0E-04 

9.1E-10 

1.6E-10 

1.1E-06 

1.9E+02 

1.4E+03 

9.0E-05 

2500 

8.3E-04 

2.6E-09 

1.9E-10 

1.2E-06 

1.9E+02 

8.0E+02 

2.2E-04 

5000 

1.1E-03 

3.6E-09 

3.7E-10 

1.3E-06 

2.0E+02 

5.7E+02 

4.3E-04 

10000 

1.2E-03 

4.6E-09 

2.0E-10 

1.3E-06 

2.0E+02 

4.2E+02 

8.6E-04 

25000 

1.4E-03 

5.8E-09 

4.6E-10 

1.2E-06 

2.0E+02 

3.0E+02 

2.2E-03 

50000 

1.4E-03 

4.7E-09 

1.2E-10 

1.2E-06 

2.0E+02 

2.6E+02 

4.3E-03 

100000 

1.7E-03 

2.2E-09 

6.8E-11 

1.2E-06 

2.0E+02 

2.3E+02 

8.7E-03 

Table  4.  Computational  results  for  Example  3,  with  e  =  10  10. 
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